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Abstract 

Metrics specifying distances between data points can be learned in a discriminative manner or from 
generative models. In this paper, we show how to unify generative and discriminative learning of met- 
rics via a kernel learning framework. Specifically, we learn local metrics optimized from parametric 
generative models. These are then used as base kernels to construct a global kernel that minimizes a 
discriminative training criterion. We consider both linear and nonlinear combinations of local metric 
kernels. Our empirical results show that these combinations significantly improve performance on clas- 
sification tasks. The proposed learning algorithm is also very efficient, achieving order of magnitude 
speedup in training time compared to previous discriminative baseline methods. 



1 Introduction 

Metric learning - learning how to specify distances between data points - has been a topic of much interest 
in machine learning recently. For example, discriminative techniques for metric learning aim to improve 
the performance of a classifier, such as the fc-nearest neighbor classifier, on a training set. As a general 
strategy, these techniques try to reduce the distances between data points belonging to the same class while 
increasing the distances between data points from different classes id Q. HI 0,11,0,13, Si- In this framework. 
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a Mahalanobis metric is parameterized by a positive (semi)definite matrix, and metric learning is performed 
using semi-definite programming (SDP) involving constraints between pairs or triplets of data points in the 
training set [9J. 

In the asymptotic limit, the performance of nearest neighbor classifiers approach a theoretical limit, 
bounded by twice the Bayes optimum error rate, which is independent of the underlying metric used ifioll . 
Only in the finite sampling case does the performance of a nearest neighbor classifier depend upon the 
choice of a metric, and 111 ill showed how the bias term can be estimated using simple class-conditional 
generative models fit to the data. A "generative local metric" (GLM) is then optimized to minimize this 
bias term. 

However, the local metric learning algorithm has several shortcomings. First, a local metric needs to 
be computed at every point, and it is difficult to calculate the geodesic distance between pairs of distant 
points. It is also unclear how to correlate the choice of generative models with discriminative classifier 
performance. 

In this paper, we address these issues by combining the learned local metrics in a global discriminative 
kernel, thus reducing the computational costs for classifying points. Our approach can be viewed as using 
metric learning to define base kernels which are then combined discriminately 1121 IllHIilEl- The 
base kernels are derived from parametric generative models, thus reaping the benefits of both generative 
and discriminative models [17, 18]. We show how both simple linear and nonlinear combinations result 
in a highly discriminative global kernel that outperforms competing methods significantly on a number of 
machine leaning datasets. Moreover, we show that our approach is also computationally more efficient 
than those methods, often achieving orders of magnitude speedup in training time. 

The paper is organized as follows. In section |2] we review previous discriminative and generative 
metric learning techniques. We describe our approach of combining local metrics trained from generative 
models in section [3] We present extensive empirical studies in section |4] followed by a discussion of our 
method and future direction in section |5] 

The Appendix for this paper includes details of derivations and implementation, more comprehensive 
empirical results, and appUcations of our approaches to unsupervised learning problems. 



2 Background 

Here we briefly review techniques for learning metrics. We start with discriminative metric learning, using 
the large margin nearest neighbor (LMNN) classifier as an illustrative example lit]. Next, we examine 
learning a generative local metric (GLM) [11], which exploits information from parametric generative 
models and does not explicity attempt to minimize classification errors. 

2.1 Discriminative learning metric 

Consider a nearest neighbor classifier which labels a D-dimensional data point x G MP by the label(s) of its 
nearest neighbor(s) a;^^ C I? in a supervised training set V. In order to identify the "nearest" neighbors, 
distances from x to data points in T) need to be determined. 

The conventional Euclidean distance, \\x — aj'jjj, is a special case of the more general Mahalanobis 
distance 

dlj{x,x') = {x-x')'^M{x-x') , (1) 

when the Mahalanobis metric M E MP ^ '-' is equal to the D-dimensional identity matrix. In this paper, we 
follow the popular terminology in the metric learning literature, calling the squared distance as "distance." 

For a general positive semidefinite matrix M, we can factor it as M ~ L^L. This implies a general 
Mahalanobis metric can be interpreted as Euclidean distance in a transformed space, x — > Lx: 

dlj{x, x') = {Lx - Lx'f{Lx - Lx') = \\Lx - Lx'Wl (2) 

Arguably, the performance of nearest neighbor classifiers depends critically on the metric M. A good 
M should intuitively "pull" data points in the same class closer and "push" data points in different classes 
away. This is the general criteria for most discriminative methods for metric learning HI |2> B B S] • 
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For example, the large margin nearest neighbor (LMNN) classifier casts the learning of M as a convex 
optimization problem. For any point Xi in the training set, it differentiates two sets of neighboring data 
points: "target" points xf whose labels are the same as Xi and "impostor" points x^ whose labels are 
different from Xi. LMNN identifies the optimal M as the solution to, 

M§i'k>o ^ ^ ' ) + 7 E 



jex+ iji (3) 

j) - dljix^,xi) < 



subjectto 1 + d%j{xt,Xj) - d%jixt,xi) < S^iji^y j e xf , lex 



where the objective function balances two forces: pulling the targets towards Xi and pushing the impostors 
away so that the distance to an impostor should be greater than the distance to the target by a minimum 
margin of one using the slack variables / . 

Note that this formulation of LMNN makes no assumptions on how the (training) data is distributed. 
Additionally, the optimization criterion is directly related to how the learned metric will be used for classi- 
fication. We see that this approach contrasts sharply with the generative model approach which we describe 
next. 



2.2 Generative learning metric 

Here we consider a binary classification problem with labels y = 1,2, and assume the N training data 
points are drawn from two class conditional distributions pi{x) = p{x\y = 1) and P2{x) = p{x\y ~ 2). 
In the asymptotic limit, N — > oo, the error rate of the nearest neighbor classifier is given in terms of the 
class conditional distribution^ 

" -d.. (4) 



Pl{x) +P2ix) 

The asymptotic error Soo can be easily shown to be invariant to a linear transformation of variables: z ~ 
Lx. This implies that learning a different metric M in Eq. (|2]i should have no effect on the error rate in 
the asymptotic limit. 

The solution to this apparent paradox is described in fll'], which showed that when the number of 
training points is finite, the error rate of the nearest neighbor classifier deviates from the asymptotic error 
rate £oo by a finite bias term. 



PlP2{P2 -Pl) 
{Pl +P2Y 



Pl 



P2 



dx 



(5) 



where the constant factor cn tends to zero as N approaches infinity, and the scalar Laplacian V^p{x) is the 
trace of the Hessian Wp{x). 

This bias term does depend upon the choice of the underlying metric, and under a linear transformation 
z = Lx, the bias term is given by the integral of 



P^P^iP^-vAj,,,, [M-i*] , with * - 



[pi +P2Y 



pi{x) 



P2{x) 



(6) 



The generative local metric (GLM) algorithm optimizes a local metric M to minimize the local bias term 
in eq. (|6|, so that the finite sample error rate £n approaches the asymptotic error rate to the first-order 
approximation. The resulting optimization with semidefinite constraints: 



mill (Trace [Mj , subjectto |Mi| = 1, M,>Q 



(7) 



is easily solved at each data point Xi using a spectral decomposition. The optimum M* is a positive 
semidefinite matrix whose eigenvectors Ui are the same as ^'i's. Then if A+ is the diagonal matrix com- 
posed of #i's d+ positive eigenvalues, and A~ is the corresponding diagonal matrix with d~ negative 



'For simplicity, we consider equal prior distributions here. Unequal class priors contribute a more complicated scalar term in Eqs. 
14161 . but the resulting derivation for the optimization of the local metric is unchanged. 
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eigenvalues, the solution can be written as: 



M* CX Ur 



d+A^ 



U 



(8) 



where the proportionality constant is determined by scaling the determinant of M* to unity. Note that this 
learning algorithm does not attempt to reduce the nearest neighbor classification error rate explicitly. 



3 Discriminative learning with multiple generative metrics 

Prior empirical studies have shown that generative learning metric (GLM) of eq. ^ performs competi- 
tively, even when compared to discriminative methods such as the large margin nearest neighbor classifiers 
(LMNN) Istl- However, GLM has has several shortcomings. For every (new) data point x to be classified, 
the optimization of eq. dHJ needs to be solved. The resulting metric depends on x and distances to the 
training data points need to be computed in this specific metric. While specialized data structures can be 
exploited to speed up the process of identifying nearest neighbors, these structures usually require a fixed 
metric and cannot be easily adapted to a new metric iQl- This can significantly increase the computational 
cost at testing time. 

Secondly, the performance of GLM depends on the specific form of the class conditional distributions 
Pi{x) and P2{x) used for generative modeling. Initial studies have suggested that even with simplistic 
models such as Gaussian distributions, GLM attains robust and competitive performance. Nevertheless, 
rigorously quantifying the relationship between the assumed generative models and the classification per- 
formance is lacking. In particular, it is unclear how the choice of the generative models should be adapted 
in order to further improve the classification performance of the nearest neighbor classifier with the learned 
metric. 

We address these issues by viewing the problem of learning metrics as learning kernels. We then 
investigate how to improve classification performance by using these kernels. To this end, we consider two 
schemes for learning kernels discriminatively; linear and nonlinear combination of metrics. 



3.1 Linear combination of local metrics 



A metric Mi learned at the training point Xi can be seen as a linear positive semidefinite kernel, defining 
the inner product between two points a;,„ and a;„. 



Ki{Xr. 



1 



(9) 



Note that while Mi is learned "locally" in the neighborhood of Xi, we treat it as an biased estimate of a 
global kernel function over the space of all training examples. We arrive at an unbiased estimator of the 
global metric — intuition to be made clear below — by linearly combining all the local kernels learned 
from the N training samples. 



^ in 7 n) — ^i-^i ("^m 1 "^n) ( ^ ^ (^i-^Mi j Xji X^-^IM^Xji 

i \ i / 



(10) 



where the combination coefficients {ai}^^i sae constrained to be nonnegative and sum to one, guaranteeing 
the resulting kernel to be positive semidefinite. The global metric M is then simply the convex combination 
of all local metrics. We now consider the simplest convex combination, uniform averaging: 



(11) 



As our empirical studies show, this surprisingly simple strategy works well in practice. 

As noted earlier, a positive semidefinite metric may be viewed as applying a linear transformation to 
the original data x. This implies that M^^^ transforms a; to a new space where on average, local metrics 
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computed in that space are proportionally to the identity matrix. Thus, on average, the Euclidean distance 
based nearest neighbor classification will perform well in that space. More formally. 

Theorem 1. Assume the class conditional distribution Pc{x) = p{x\y ~ c) is Gaussian for every class. 
Let Mi be the local metric computed with eq. minimizing the bias term in the space of x. Then, the 
uniform convex combination metric M™' of eq. ( 1771 ) induces a linear transformation z — Lx where 
L^L = M. Furthermore, let Qi denote the local metric computed in the space of z under the new class 
conditional distribution Pc{z). We have, 

N 

^Q^ocJ (12) 

i=l 

where I is the identity matrix. 

The proof exploits the fact that Pc{z) is also Gaussian and Qi can be expressed as a closed-form 
expression of Mi. Details are presented in the Appendix (sectionlAll. 



3.2 Nonlinear combination of local metrics 

In order to combine local metrics in a nonlinear fashion, we use Gaussian radial basis (RBF) kernel func- 
tions to replace the standard identity covariance matrix, 

Kii{Xra,Xn) = exp{-(a;„i - XnYMi{Xm - x^) / of) (13) 

where ai is a bandwidth parameter, chosen from a range of possible values between (imin and fTmax- 

Our goal is to learn a convex combination of these RBF base kernels. We follow the standard multiple 
kernel learning (MKL) framework where base kernels are combined as ifisll . 

K{xm, Xn) = ^ aiiKii{xm, x^) subject to Ui^i > 0, ^ Ui^i = 1 (14) 

i,l i,l 

Note that the combined kernel K{-, ■) is a highly nonlinear, albeit convex function of local metrics. It is 
well-known that positive semidefinite kernels, including ours in eq. (fl4] i. can be represented as distances 
in the corresponding Reproducing Kerner Hilbert Space (RHKS) 1.19.1. However, as opposed to the global 
metric M^^^, we cannot represent this distance (and its associated metric) as a closed-form function of x 
and {TWJti- 

In typical applications of MKL, one often chooses Gaussian RBF kernels with identity covariance ma- 
trices Gxp(— ||a;„i — Xn\\2/o'f)- This is due to the difficulty in properly choosing non-identity covariance 
matrices for the base kernels, especially in high-dimensional problems. Our formulation in eq. (fT4] l over- 
comes the challenge by using non-Euclidean metrics computed from generative modeling. 

We refine the combination by optimizing {an} discriminatively. Specifically, the coefficients {an} are 
adjusted so that the kernel K{-, ■) achieves the lowest empirical risk when used in kernel based classifiers 



such as support vector machines 111311 . In this aspect, our formulation reaps the benefits of both generative 



modeling and discriminative training. 



3.3 Convex combination: revisited 

One may wonder why the framework of multiple kernel learning, used for nonlinear combination of metrics 
in section [J!2l is not used to discriminatively optimize the convex combination coefficients of eq. (fTOl i. Our 
preliminary results indicate that M^^^ in general performs well. This is consistent with previous extensive 
work on combining kernels linearly — discriminative learning of such combinations does not reliably 
outperform simpler strategies of combinations including the uniform combination |20, 131. We present 
more experimental details, including other forms of convex combinations, in the Appendix (section|B]i. We 
have found that M^^^ is both computationally appealing and empirically very effective. 
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Table 1: Error rates of misclassification (in %) on the 8 small-scale datasets 



Method 


Dataset 


Avg. 
Rank 


3-Norm. 


Wine 


Iris 


Heart 


Vehi. 


lonos. 


Image 


German 


Euclidean 


8.17 


4.41 


5.11 


22.78 


31.29 


15.87 


2.68 


27.27 


5.75 


LMNN 


4.70 


2.61 


4.78 


21.91 


21.97 


12.39 


2.14 


27.20 


4.25 




3.83 


3.96 


3.67 


21.60 


25.32 


6.24 


2.89 


26.17 


3.75 




3.44 


1.80 


3.33 


19.51 


17.47 


9.67 


2.47 


26.12 


1.88 


LMNNb 


3.70 


2.43 


4.67 


20.56 


20.37 


11.97 


2.67 


26.88 


3.25 


^UNl 


3.10 


2.52 


3.11 


19.26 


15.81 


10.80 


3.01 


25.15 


2.13 



3.4 Computational complexity and optimization 

The computational complexity of our algorithms is dominated by the calculation of local metrics in eq. (O. 
The main calculation involved is diagonalizing the matrix For D-dimensional space x, the computa- 
tional cost is 0(D'^). Since the local metrics are computed at every training sample, the total computational 
cost is 0(ND3). Computing M™' itself adds little overhead. 

In contrast, discriminative techniques such as LMNN for learning a single global metric require iter- 
ative numerical optimization. For LMNN, the optimization needs to examine roughly ©(N'^) number of 
constraints. For very large N and small to moderate D < a/N, our approaches will greatly outperform 
LMNN in speed, as demonstrated later in sectionH) 

4 Experimental results 

We compare our methods of discriminative kernel learning from generative local metrics (GLMs) to other 
competitive methods of metric learning. Here we report the results of applying simple linear (section ITTT l 
and nonlinear combinations (section [372l i to classification. More comprehensive details are included in the 
Appendix (sectionlBl-lDl). 

4.1 Setup 

Datasets We have used 10 datasets: 3-Normal, Iris, Wine, Heart, Vehicle, Ionosphere, Image, German, 
MNIST and Letters. The first 8 datasets are small-scale, having 150-2310 data points with dimensionality 
ranging from 4 to 34. The number of labelled classes range from 2 to 4. The MNIST and Letters datasets 
are substantially larger: MNIST has 70,000 deskewed images with 10 classes while Letters has 20,000 
examples with 26 labelled classes. 3-Normal is a synthetic set containing a mixture of 3 Gaussians. Other 
datasets are downloaded from the UCI machine learning repository |21], the IDA benchmark repositorjQ 
and NYU [22]. 

Data in the small-scale datasets is preprocessed so that the feature vector components range between 
— 1 and 1. For supervised learning tasks such as classification, each dataset is randomly split 30 times into 
training (60%), vaUdation (20%) and testing (20%) sets. 

The MNIST images have a resolution of 784 pixels and are preprocessed with PCA, reducing the 
dimensionality to 40, 60 and 164 respectively, to save training time and prevent overfitting. We perform 

5 random splits, each with 65000 samples for training, 5000 for validation and 10000 for testing. For the 
Letters dataset, we perform 10 random splits, each with 12000 samples for training, 2000 for validation 
and 6000 for testing. The Letters-scaled set is with the features scaled to lie within the range [—1,1]. We 
also provide experimental results on the unsealed version (denoted as Letters-original), as the training time 
of LMNN is sensitive to the scaling for this dataset. 

Learning methods The various learning methods used in our comparative study are summarized be- 
low: 

'http://www.fml.tuebiiigen.mpg.de/Members/raetsch/benchmark 
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• Euclidean, fc-nearest neighbor (fcNN) classifier using Euclidean distances. 



LMNN. fcNN classifier using the metric of Large Margin Nearest Neighbor |5] (cf. section|2]). 

GLM™^. fcNN classifier using the generative local metrics (GLM) [111] (cf. section|2]i. Gaussian distri- 
butions are used as the class-conditional distributions for generative modeling. We follow the procedure 



in 111 in to interpolate GLM with the Euclidean metric. (Un-interpolated GLMs underperform interpo 



lated ones, a finding which is consistent with what is reported in 111 111 ). 



• Af fcNN classifier using our approach of uniformly combining the un-interpolated GLMs into a 
single metric, described by eq. ( fTTT l. 

• LMNN E ■ Energy-based classification using the metric of LMNN fslJE 

• Af™'. Energy -based classification using the metric of M^^^. 

Tuning The parameters of all methods are optimized on validation sets, and their overall performance is 
reported on test sets. Tunable parameters include the number of (target) nearest neighbors, the interpolation 
parameter in the GLM"^^, and the margins used in the two energy-based classification. We have used the 
LMNN implementation as reported in (iSJ . 



4.2 Linear combination of generative local metrics 

Performance on classification tasks Table [T] displays averaged misclassification error rates (over 30 ran- 
dom splits) for the 8 small-scale data sets. Standard errors are reported in the Appendix (section 0. The 
last column is the averaged ranking in performance (across 8 datasets); the smaller the number the better 
the performance is on average. GLM™^ outperforms LMNN and Euclidean on most sets, though its per- 
formance is surpassed by LMNN^. However, the best performers are M^^^ and M^^, which use the 
simple strategy of uniformly combining the generative local metrics. 

Table 12] displays averaged error rates (over 5 random splits for MNIST and 10 for Letters) on large- 
scale datasets. Standard errors are reported in the Appendix (section0. For the MNIST dataset, we report 
results on several PCA-preprocessed dimensionalitjB For Letters, we report results on two cases using 
scaled and unsealed features. 

On the MNIST dataset, it is clear that both M™^ and LMNN e perform better than other methods 
when the dimensionality is low (D < 60). However, at the larger dimensionality of 164, both LMNN and 
LMNN^; outperform other methods including our approaches. One possible explanation is that, with the 
increased dimensionality, the generative modeling used by both GLM™^ and our approaches {M^^^ and 
Ai'™'), does not fit the data properly. On the other hand, discriminative training might be able to overcome 
the problem with better regularization. 

On the Letters dataset, our approach of M]^^ clearly outperforms all other methods. 

Computational efficiency Details are presented in the Appendix (section IC.Llb . In summary, we 
observe that our methods are computationally efficient, achieving orders of magnitude speedup in training 
time. For example, on MNIST-40, LMNN takes about 40 minutes to learn the final metric while our M^^^ 
algorithm takes about 4 minutes. 



4.3 Nonlinear combination of generative local metrics 

We also report the results of a nonlinear combination of generative local metrics, using the framework of 
discriminative kernel learning described in section l372l The baseline system learns a kernel in the following 
form K(x^ x') = a; exp{ — ||a; — x'\W/af} where ct; is the bandwidth of the kernel. Our method of 

^The energy of a point being assigned to a class c is defined as tlie differences between two quantities: tlie (sum of) distances of 
this point to its nearest neighbor in the class c and the (sum of) distances of this point to its nearest neighbor in classes other than c. 
A point is assigned to a class label which has the lowest energy. Energy-based classification can improve performance significantly 
over fcNN classification |5]. 

' As a comparison point, MNIST- 164 has the same dimensionality as the one reported in the work of LMNN @], with an eiTor 
rate of 1.37% using the energy-based classification, denoted as LMNN£; in the table. We obtain a similar error rate of 1.34%. 
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Table 2: Error rates of misclassification (in %) on the two large-scale datasets 





MNIST-40 


MNIST-60 


MNIST-164 


Letters-scaled 


Letters -original 


Euclidean 


2.09 


2.02 


2.16 


5.05 


5.25 


LMNN 


1.99 


1.84 


1.82 


3.91 


3.81 




3.75 


3.55 


3.48 


5.55 


5.51 




1.93 


1.90 


4.30 


3.04 


2.96 


LMNNb 


1.53 


1.43 


1.34 


2.98 


2.90 


^UNl 


1.40 


1.44 


3.13 


2.26 


2.28 



Table 3: Error rates of misclassification (in %) with nonlinear combination of metrics 





3 -Normal 


Wine 


Iris 


Heart 


Vehi. 


lonos. 


German 


Baseline 


6.08 


2.16 


5.56 


18.09 


26.53 


6.90 


25.22 


Our approach 


2.31 


2.43 


2.78 


17.16 


15.44 


9.01 


24.25 



discriminative kernel learning replaces the Euclidean distance in the conventional Gaussian RBF kernel 
with the generative local metrics (GLMs), as in eq. (flJl l. There are as many local metrics as the number of 
training examples. Thus, for our implementation, we use "regional" metrics in eq. (l3[ . Specifically, we 
partition the training data into P parts. We average local metrics of data points in each part and obtain P 
"regional" metrics {Mp}^^^. In the specific case where P = 1, we will get M^^^, the uniform linearly 
combined metric. 

For both the baseline method and our approach of kernel learning with eq. (fT3T l. the combination co- 
efficients are optimized with the SimpleMKL algorithm (t^]. Table [3] displays averaged misclassification 
error rates on 7 (out of 8) small-scale datasets. Experiments on other datasets are ongoing. 

We experiment with different P = 1,5, and 10 and aggregate the results by reporting the best per- 
forming P in Table [3] Further details of our method's performance with different P are provided in the 
Appendix, section |C] On 5 out of the 7 datasets, nonlinear combination of metrics clearly outperforms the 
baseline, with significant improvement on the datasets of 3-Normal, Iris and Vehicle; however, our method 
performs poorly on the Ionosphere dataset. Note that in Table [T] the local metrics used alone attain a better 
error rate (6.24%) than the best nonlinear kernel method (6.90%). Thus, more analysis is still needed to 
understand effective methods for nonlinear combinations of local metrics. 



4.4 Application to unsupervised learning problems 

Many unsupervised learning problems, such as clustering and dimensionality reduction, also rely upon a 
proper metric to calculate distances. We have also investigated how to apply algorithms to learn metrics 
to such unsupervised problems. One crucial step is to extract discriminative information from unlabeled 
data for the algorithms to compute better metrics. To address this issue, we have developed an EM-like 
procedure to iterate between inferring labels and computing local and global metrics. Details are presented 
in the Appendix (sectionlDli. 

We applied this procedure to a number of unsupervised learning problems. We achieve significantly 
better performance than standard approaches for clustering. Additionally, we can exploit the learned metric 
for dimensionality reduction, for instance — learning the nonlinear manifold structure of data. As an illus- 



trate example, we show the benefits of using the global metric M'^^^ with the algorithm of IsoMap II24I. 

In particular, we compare the IsoMap embedding results computed with the Euclidean metric and the 
results with the M^^^ metric on the MNIST dataset. We selected 400 random samples from different digits 
'3', '5', and '9', and resized the images to 7 x 7. Fig.[T]plots the two different low-dimensional embeddings 
of data samples, colored according to their digit identities. This clearly shows that learning a global metric 
helps to discover a better embedding that exhibits clear clustering structure among different class identities. 
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Figure 1: Isomap embeddings of images of digits, comparing Euclidean distance with our method of 
learning a global metric. 



5 Discussion 

In the context of metric learning, we have proposed several new approaches that can reap the benefits of 
both discriminative training and generative modeling. Our method builds upon the connection between a 
kernel learning framework and using learned positive semidefinite metrics from generative models as base 
kernels. Empirical studies validate our algorithms in both improving classification performance across a 
variety of datasets as well as in computational efficiency in implementation. Ongoing work includes further 
investigations into more effective approaches to training nonlinear combinations of learned local metrics 
in a discriminative manner. 
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Appendices 



A Proof of Theorem [T] 

Theorem[T]A ssume the class conditional distribution Pc{x) — p{x\y = c) is Gaussian for every class. Let Mi be the 
local metric computed with eq. |7|, minimizing the bias term in the space of x. Then, the uniformly combined metric 
j^uNi gq |[77]) ifidues a linear transformation z = Lx where L^L = M. Furthermore, let Qi denote the local 
metric computed in the space of z under the new class conditional distribution Pc{z). We have, 

N 

Y.Q^'^i (15) 

where I is the identity matrix. 

Proof. Let $i denote the matrix characterizing the bias term on Xi in the original space (cf. eq. For muhiway 
classification, the matrix is given by 

* = ^VVpe(a;) j ^p^^(a;)-p,(a;);^p,,(a;) j (16) 

where we have dropped the subscript i for clarity. For Gaussian class conditional distributions pdx) = Af{x\fj,c, Sc), 
the Hessian VVpc(a;) is given by 

VVpc{x) = pc{x) [-E-^x - fic){x - Hcf-E-' - Ec'] (17) 

Under the linear transformation Lx, the matrix for the bias term in the new space is now 
establish the relationship between Qi, which satisfies 

Trace [Qr'*i] = 0, |Q,| = 1, Q, >r (18) 

and Mi, the solution in the original space 

Trace [M^^*,] = 0, \Mi\ = 1, ^ (19) 

Let Qi — \L\'^^° MiL~~^ , where D is the input dimensionality. We claim Qi is the solution to eq. l |18l l as long as 
Mi is the solution to eq. l |19t : 

Trace [Qr'*i] = Trace \l7 M^^ LL-'-^^L-^'^ = Trace [M^'**] = (20) 

and \Q,\ = |M,| ^l,Q^>l 0. 
Thus, we have, 

Y.Q^ = \L\^'° J2 L-'M^L-'' = N\L\'^°L-' ( 1/N ^ mA L"^ (21) 

i i \ i / 

= N\Lf^°L~'^ML-'^ (22) 

Let the eigen-decomposition of M be UAU^, where A — diag(Ai, Ad) contains all eigenvalues and U con- 
tains all eigenvectors. We set L = f/Af/^, where A — diag(\/A]", \/Ad). Then, L is an induced transformation 
from M since L^L = M. Plugging L into eq. (I22t. we obtain: 

^Q^ = N|i|^/°i"^ML"'^ = N\L\'^^°I oc / (23) 
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Algorithm 1 Density -based Convex Combination of Local Metrics 
1: Compute the local metric Mi for each point Xi 
2: for iter ^ I to MAXITER do 
3: Estimate the density p{x) for each point 
4: Compute the global metric M — J2i p{xi)Mi 
5: Transform each point by Xi ^ Lxi, where L^L = M 
6: end for 
7: Return M 



B Linear combination of local metrics: other forms 

The uniform combination eq. dllt and the resulting metric Af™', is a special case of convex combination of local 
metrics. Here, we consider another form of convex combination, where the combination coefficients (or weights) are 
proportional to the probabilistic density of each point. The densities are estimated by density estimators. Typically, a 
density estimator also depends on the metric used: a better metric can often lead to better estimation of densities. Thus 
we propose the iterative procedure listed in the Algorithm[T]to jointly estimate the densities and combine local metrics. 

We have experimented with two types of density estimators: 

• Kernel Density Estimator (KDE). Given the training data {x\, ...,Xn}, the density of a testing point xt is 
defined as p{xt) = l/hY^^^-^s'yq){ — \\xt — XiW^/a^), where /i is a normalization constant, and a is the 
bandwidth parameter which is tuned on the validation data (to maximize the likelihood). 

• Gaussian Mixture Model (GMM). The model is built by modeling each class as a single Gaussian. Note that 
the densities calculated from GMM are metric invariant if the class assignments are fixed. To make the density 
metric-dependent, we use the following trick: once the global metric is learnt, we use it to re-classify all training 
data (by fcNN) based on the validation data. After that, we build GMM according to the new labels, and reiterate 
the process. 

C Experimental details on metric learning for supervised learning 
tasks 

C.l Linear combination of local metrics 

We compare eight metric learning methods, including two new methods described in SectionlBl 

• Euclidean. We use the identity matrix as a metric to compute distances. 

• LMNN. We learn a single metric discriminatively using the large margin nearest neighbor method j^, as re- 
viewed in section[2] 

• GLM™^. We learn local metrics using generative techniques flT], as reviewed in section|2] We use Gaussian 
distributions as the class conditional distributions for our generative modeling. We follow the procedure defined 
in fllll to interpolate the learned local metrics with the Euclidean metric when we classify new data points. We 
do not report the results of un-interpolated metrics GLMs as our findings are consistent with the authors of fTHl . 
The interpolated metrics have much better performance. 

• JW™'. This is our approach of combining the un-interpolated GLMs into a single metric with the uniform 
combination, described in eq. i ll 11 1 of section [3Tl 

• LMNN_B. With the same metric of LMNN, this method performs energy-based classification |5]. Loosely 
speaking, the energy of a point being assigned to a class c is defined as the differences between two quantities: 
the (sum of) distances of this point to its nearest neighbor in class c and the (sum of) distances of this point to its 
nearest neighbor in classes other than c. A point is assigned to the class which has the lowest energy. According 
to 0], energy-based classification sometimes improve performance significantly over purely nearest-neighbor 
based classification. 

• M°^*^. Learn the global metric as a weighted combination of local metrics. The weight is proportional to the 
density estimated from Gaussian Mixture Model. The number of iterations in the Algorithm[T]is set to 20. 
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Table 4: Error rates of misclassification (in %) on small-scale datasets 



METHOD 


DATASET 




3 -Normal 


Wine 


Iris 


Heart 


Euclidean 


8.17 ±0.33 


4.41 ±0.61 


5.11 ±0.69 


22.78 ±0.88 


LMNN 


4.70 ±0.30 


2.61 ±0.38 


4.78 ± 0.71 


21.91 ±0.84 




3.83 ±0.18 


3.96 ± 0.64 


3.67 ± 0.54 


21.60 ±0.92 


^UNl 


3.44 ± 0.25 


1.80 ± 0.40 


3.33 ± 0.48 


19.51 ±0.90 


jy^UMM 


2.96 ±0.21 


2.97 ±0.51 


3.44 ± 0.54 


19.69 ±0.93 




3.09 ± 0.22 


2.07 ± 0.46 


3.33 ± 0.48 


18.95 ± 0.90 


LMNNb 


3.70 ± 0.24 


2.43 ± 0.44 


4.67 ± 0.54 


20.56 ± 0.80 




3.10 ±0.23 


2.52 ±0.52 


3.11 ± 0.57 


19.26 ± 0.76 


METHOD 


DATASET 




Vehicle 


Ionosphere 


Image 


German 


Euclidean 


31.29 ±0.54 


15.87 ±0.68 


2.68 ±0.13 


27.27 ± 0.60 


LMNN 


21.97 ±0.47 


12.39 ±0.57 


2.14 ±0.12 


27.20 ± 0.58 




25.32 ± 0.65 


6.24 ± 0.52 


2.89 ±0.15 


26.17 ±0.47 


^UNl 


17.47 ± 0.30 


9.67 ± 0.48 


2.47 ±0.13 


26.12 ±0.62 


jy^UMM 


18.15 ±0.49 


9.77 ± 0.49 


2.27 ±0.10 


25.87 ±0.68 




17.17 ±0.33 


9.01 ± 0.49 


2.12 ±0.12 


26.10 ±0.66 


LMNNb 


20.37 ± 0.52 


11.97 ±0.66 


2.67 ±0.13 


26.88 ±0.55 




15.81 ± 0.52 


10.80 ±0.71 


3.01 ±0.16 


25.15 ±0.51 



• M . Learn the global metric as a weighted combination of local metrics. The weight is proportional to the 
density estimated from (Gaussian) Kernel Density Estimator. The number of iterations in the Algorithm[T]is set 
to 20. 

For energy-based classification, we need to set a quantity called "margin" j^. We have used the follow procedure: 

• transform all samples by the learnt metric. 

• for each sample, compute the difference from: a) the distance to its nearest neighbor in the same class; b) the 
distance to its nearest neighbor in other classes. 

• compute the median of these differences and denote its value as 70. Consider /370 as candidate margins, where 
/? is a scaling factor tuned on validation set. 

The error rates (mean and standard error) on small-scale datasets are listed in Table |4] with ranking information 
shown in Tabled We can see that on most datasets, M™' and M™' outperform GLM'^'^, LMNN, LMNNis and 
Euclidean. Additionally, M™^ performs better than M™\ M^^^ and M™"^ in general. However, M™^ is less 
efficient than M™^ due to its iterative nature. We plan to explore more efficient and theoretical-sound combination 
approach in our future work. 

The error rates (mean and standard error) on large-scale datasets are shown in Table[6] M™^ and M^^' generally 
performs well. In particular, with the low dimensionality of 40, M™^ reaches almost the same accuracy (error rate: 
1.40%) as discriminatively trained metrics (LMNN) at a higher dimensionality of 164 (error rate: 1.34%). 

C.1.1 Training speed 

The training time of LMNN and M™^ on small-scale and large-scale datasets are given in Table |7] and [8] respectively. 
Note that the time reported here is the training time per tuning (i.e. run once with fixed parameters), which does not 
count the time for parameter-tuning (required in LMNN). Clearly, on most datasets, M™^ achieves one or two-order- 
magnitude speedup over LMNN. It is also interesting to point out that the scale of features may affect the training time 
of LMNN significantly (LMNN runs faster on Letters-original than Letters-scaled), as it can change the number of 
active constraints. 
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Table 5: Ranking of different methods 



METHOD 


DATASET 


Avg. 
Rank 


3-Norm. 


Iris 


Wine 


Heart 


Vehi. 


lonos. 


Image 


German 


Euclidean 


8 


8 


8 


8 


8 


8 


6 


8 


7.75 


LMNN 


7 


5 


7 


7 


6 


7 


2 


7 


6 




6 


7 


5 


6 


7 


1 


7 


5 


5.5 




4 


1 


2 


3 


3 


3 


4 


4 


3 




1 


6 


4 


4 


4 


4 


3 


2 


3.5 




2 


2 


2 


1 


2 


2 


1 


3 


1.87 


LMNNb 


5 


3 


6 


5 


5 


6 


5 


6 


5.13 




3 


4 


1 


2 


1 


5 


8 


1 


3.13 



Table 6: Error rates of misclassification (in %) on large-scale datasets 



METHOD 


DATASET 


MNlST-40 


MNlST-60 


MNlST-164 


Letters-scaled 


Letters -original 


Euclidean 


2.09 ± 0.02 


2.02 ± 0.02 


2.16 ±0.01 


5.05 ± 0.07 


5.25 ± 0.08 


LMNN 


1.99 ±0.03 


1.84 ±0.01 


1.82 ±0.03 


3.91 ±0.08 


3.81 ±0.13 




3.75 ± 0.06 


3.55 ± 0.05 


3.48 ± 0.09 


5.55 ±0.08 


5.51 ±0.08 




1.93 ±0.03 


1.90 ±0.03 


4.30 ± 0.04 


3.04 ± 0.08 


2.96 ± 0.09 


LMNNb 


1.53 ±0.01 


1.43 ± 0.01 


1.34 ±0.01 


2.98 ± 0.06 


2.90 ± 0.06 




1.40 ± 0.01 


1.44 ±0.01 


3.13 ±0.03 


2.26 ± 0.04 


2.28± 0.06 



C.2 Nonlinear combination of local metrics 

We learn the nonlinear combination of metrics in the framework of convex combination of nonlinear kernels fl3ll23ll . 
Our baseline systems learn a kernel in the following form 

K{x,x') = ^aiexp{-ri|lcc-a;'|j2/ao} (24) 
I 

where a; represents coefficients of convex combination, (Tq is a normalization factor to fix the scale of the kernel. The 
"scaled" (inverse) bandwidth r; takes values from [2^'', 2~"' , 2^, 2**]. 

For nonlinear combination of metrics, instead of using all local metrics, we consider P "regional" metrics for the 
sake of computational efficiency (the regional metrics are obtained by averaging the local metrics in each cluster). We 
then use those regional metrics to compose the desired kernel with the formulation of eq. il3l . We use the same scaling 
scheme for constructing the baseline eq. ( I24t and adjust uo accordingly (with respect to each different regional metric). 

The combination coefficients are learnt in the framework of SimpleMKL |23], which minimizes the empirical 
risk of the support vector machines (SVM). For simplicity, let us denote the combined kernel as aiKt, where 
{Ki} refers to the set of base kernels. SimpleMKL essentially solves the following optimization problem for binary 
classification (can be extended for multi-way classification using one-against-all or one-against-one approaches); 

minmax (3 e - a^Kr ) /3 (25) 

« /3 2 j 

subject to /3 > 0, (3 y = 0; a > 0, ae = (26) 

where /3 is the vector containing SVM dual variables, and e refers to all-one vectors. 

The SimpleMKL is an iterative numerical optimization procedure to optimize the kernel combining coefficients 
ai. The amount of time in finding the optimal solution depends on many factors including the number of base kernels. 
Thus, its computational cost can be substantive. In this aspect, we view M™\ ie, simply averaging local metrics, as a 
strong contender in both improving performance and computational efficiency. 
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Table 7: Training time per tuning (seconds) on small-scale datasets 





3-Norm. 


Wine 


Iris 


Heart 


Vehi. 


lonos. 


Image 


German 


LMNN 


195.3 


1.9 


3.5 


3.2 


126.6 


12.1 


160.7 


8.2 




0.3 


0.03 


0.04 


0.06 


0.5 


0.3 


0.8 


0.3 



Table 8: Training time per tuning (minutes) on large-scale datasets 





MNIST-40 


MNIST-60 


MNlST-164 


Letters-scaled 


Letters -original 


LMNN 


42 


55 


215 


14 


3 




4 


7 


47 


1 


1 



The results on misclassification error rates are shown in Table|9l where Local (P) denotes our nonlinear combina- 
tion method with P regional metrics. Out of 7 datasets we have experimented, nonlinear combination of metrics often 
outperform the baseline, with significant improvement on the datasets of 3-Normal, Iris and Vehicle. We observe that 
although nonlinear combing local metrics is promising, choosing the right number of local kernels is important. (It is 
often impractical to combine local kernels defined on all data points due to the heavy computational cost.) 

D Application to unsupervised learning 

Many unsupervised learning problems, such as clustering and dimensionality reduction, also depend on using a proper 
metric to calculate distances. We investigate how to apply the supervised metric learning algorithms to such problems. 
The crucial step is to extract labels for the algorithms to compute better metrics. 

In unsupervised clustering, we start with fc-means clustering with the Euclidean distance metric. We then treat 
the cluster labels as if they are ground-truth class labels. We apply the generative metric learning algorithm to such 
"labeled" data, compute local metrics and then global metric. We then apply fc-means clustering again using the learnt 
global metric to compute distances. We iterate the process for a few times or until the cluster labels no longer change. 
Our empirical study shows that this simple strategy works well. The algorithm returns with clustering of higher quality, 
measured in standard measures, than fc-means. Note that being able to have a global metric is essential. Without it, it 
is difficult to compare distances measured in different (local) metrics. 

We demonstrate the usage of our global metric in the clustering problem. We use the small-scale datasets in the 
previous section, and try fc-means clustering with the metric from M™^, as well as with the standard Euclidean metric. 
We set the k equal to the number of classes, and iteratively obtain labels and metrics, as described previously. 

The clustering results are measured by the RAND score. It is a similarity measure between two label sets, where 
the maximum 1 indicates the two labels sets show the exactly same clustering results. We calculate the RAND score 
based on the cluster assignment returned by fc-means and the true labels of these datasets. 

For unsupervised learning, we find that it is useful to regularize covariance matrices and interpolate local metrics, 
before computing the global metric. Indeed, regularization and interpolation can prevent overfitting, and generally lead 
to better clustering results. To tune these parameters, we split each dataset into a training, validation and testing set by 
the ratio of 60/20/20. We adopt the following procedure for parameter tuning: for each parameter combination, learn 
clusters on the training set, and use the clusters to cluster the validation set. Then, we compute the RAND score on the 
validation set to measure the performance of current parameter combination. Finally, we use the best tuned parameter 
combination to learn the clusters on the training set, and use them to cluster the testing set. We report the RAND score 
on the testing set as an indicator of the model performance. 

Table [Tol gives the RAND scores of different methods, averaged over 30 splits. We find that fc-means + M™^ 
performs better than fc-means + Euclidean on 5 (out of 8) datasets, with significant improvement on Iris and Image 
datasets. However, metric learning can also have negative effect, as revealed on Heart and Wine datasets. 
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Table 9: Error rates of misclassification (in 9r) with nonlinear combination of metrics 



METHOD 


DATASET 


3-Normal 


Wine 


Iris 


Heart 


Baseline 


6.08 ± 0.27 


2.16 ± 0.40 


5.56 ± 0.65 


18.09 ± 0.71 


Local (P = 1) 


3.36 ± 0.26 


2.43 ± 0.49 


3.33 ±0.60 


17.16 ± 0.57 


Local (P = 5) 


2.72 ± 0.26 


4.14 ±0.56 


2.78 ± 0.36 


19.26 ± 0.82 


Local (P = 10) 


2.31 ± 0.24 


3.96 ± 0.70 


2.89 ± 0.47 


18.95 ± 0.79 


METHOD 


DATASET 


Vehi. 


lonos. 


German 




Baseline 


26.53 ± 0.53 


6.90 ± 0.43 


25.22 ± 0.38 


Local (P = 1) 


15.56 ± 0.45 


9.01 ±0.58 


24.38 ± 0.44 


Local (P = 5) 


15.44 ± 0.42 


10.47 ± 0.65 


24.30 ± 0.42 


Local (P = 10) 


15.58 ± 0.47 


9.81 ±0.53 


24.25 ± 0.40 



Table 10: RAND score on small-scale datasets 



METHOD 


DATASET 


3 -Normal 


Iris 


Wine 


Heart 


fc-means+Euclidean 


0.528 ± 0.003 


0.878 ± 0.009 


0.935 ± 0.008 


0.654 ± 0.008 


fc-means+M^Ni 


0.545 ± 0.008 


0.948 ± 0.006 


0.930 ± 0.008 


0.643 ± 0.007 


METHOD 


DATASET 


Vehi. 


lonos. 


Image 


German 


fc-means+Euclidean 


0.651 ± 0.001 


0.565 ± 0.007 


0.510 ±0.002 


0.500 ± 0.0005 


fc-means+M^'^i 


0.657 ± 0.002 


0.569 ± 0.008 


0.567 ± 0.006 


0.500 ± 0.0005 
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